Realization of binary neural networks in nand memory arrays

ABSTRACT

Use of a NAND array architecture to realize a binary neural network (BNN) allows for matrix multiplication and accumulation to be performed within the memory array. A unit synapse for storing a weight of a BNN is stored in a pair of series connected memory cells. A binary input is applied as a pattern of voltage values on a pair of word lines connected to the unit synapse to perform the multiplication of the input with the weight by determining whether or not the unit synapse conducts. The results of such multiplications are determined by a sense amplifier, with the results accumulated by a counter.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation application of U.S. patentapplication Ser. No. 16/368,347, filed Mar. 28, 2019, which claimspriority from U.S. Provisional Application No. 62/702,713, filed Jul.24, 2018, and is related to an application entitled “Realization ofNeural Networks with Ternary Inputs and Binary Weights in NAND MemoryArrays” by Hoang et al., all of which are incorporated herein byreference.

BACKGROUND

Artificial neural networks are finding increasing usage in artificialintelligence and machine learning applications. In an artificial neuralnetwork, a set of inputs is propagated through one or more intermediate,or hidden, layers to generate an output. The layers connecting the inputto the output are connected by sets of weights that are generated in atraining or learning phase by determining a set of a mathematicalmanipulations to turn the input into the output, moving through thelayers calculating the probability of each output. Once the weights areestablished, they can be used in the inference phase to determine theoutput from a set of inputs. Although such neural networks can providehighly accurate results, they are extremely computationally intensive,and the data transfers involved in reading the weights connecting thedifferent layers out of memory and transferring them into the processingunits of a processing unit can be quite intensive.

BRIEF DESCRIPTION OF THE DRAWINGS

Like-numbered elements refer to common components in the differentfigures.

FIG. 1 is a block diagram of one embodiment of a memory system connectedto a host.

FIG. 2 is a block diagram of one embodiment of a Front End ProcessorCircuit. In some embodiments, the Front End Processor Circuit is part ofa Controller.

FIG. 3 is a block diagram of one embodiment of a Back End ProcessorCircuit. In some embodiments, the Back End Processor Circuit is part ofa Controller.

FIG. 4 is a block diagram of one embodiment of a memory package.

FIG. 5 is a block diagram of one embodiment of a memory die.

FIG. 6 illustrates a simple example of an artificial neural network.

FIG. 7A is a flowchart describing one embodiment of a process fortraining a neural network to generate a set of weights.

FIG. 7B is a flowchart describing one embodiment of a process forinference using a neural network.

FIG. 8 is a schematic representation of the use of matrix multiplicationin a neural network.

FIG. 9 is a table illustrating the output of a binary neural network inresponse to the different input-weight combinations.

FIG. 10 illustrates an embodiment for a unit synapse cell for storing abinary weight in a pair of series connected memory cells.

FIG. 11 illustrates the distribution of threshold voltages for thestorage of data states on a binary, or single level cell (SLC), memory.

FIGS. 12 and 13 illustrate an embodiment for implementing a binaryneural network using a pair of series connected SLC memory cells as unitsynapse.

FIG. 14 illustrates the incorporation of the unit synapses into a NANDarray.

FIGS. 15 and 16 consider an example of the computation of a dot-productfor the binary neural network algebra and how to implement this using acounter based summation digital circuit for an SLC NAND binary neuralnetwork (BNN) embodiment.

FIG. 17 is a flowchart for one embodiment of a dot-product calculationusing a binary neural network in inference.

FIG. 18 illustrates an embodiment of a summation circuit for an SLC NANDarray to support binary neural networks.

FIG. 19 is a flowchart for one embodiment of a dot-product calculationusing a ternary-binary neural network in inference, as illustrated inthe tables of FIGS. 15 and 16 and array architecture 18.

FIGS. 20 and 21 illustrate an example of a neural network and itsimplementation through a NAND array.

FIG. 22 illustrates an example of a neural network and itsimplementation through a NAND array to achieve a high parallelism acrossNAND blocks by leveraging multiple blocks within a single plane.

FIG. 23 is a flowchart for one embodiment of a dot-product calculationsimilarly to FIG. 17, but that incorporates the multi-block parallelismillustrated by FIG. 27.

FIG. 24 illustrates additional embodiments that can inference for theinputs of a neural network concurrently across multiple planes.

FIG. 25 illustrates an embodiment of plane pipelining for differentneural network layers.

FIG. 26 illustrates an embodiment in which weights of different layerscan be stored in the same block, same plane, or both.

DETAILED DESCRIPTION

To reduce the computational complexity and relax the memory requirementsof neural networks, Binary Neural Networks (BNNs) have been introduced.In BNNs, the weights and inputs of the neural network are truncated intobinary values (−1, +1) and the binary arithmetic simplifiesmultiplication and addition to XNOR and bit-count operations. Thefollowing disclosure presents techniques for exploiting the structure ofNAND memory for the storage of the weights of binary neural networks andfor the execution of the multiply-and-accumulation operations within theNAND memory.

Each binary weight is stored in a unit synapse formed of a pair ofseries connected binary memory cells, such as a pair of adjacent memorycells on a NAND string, where one of the memory cells is in a programmedstate and the other in an erased state. Depending on which memory cellof the unit synapse is in the programmed and which memory cell is in theerased state, the unit synapse will either be in the −1 or +1 weight.The binary input is then applied as a voltage pattern on thecorresponding word lines, in which one of the word line pair is at aread voltage (for which only the erased state memory cell will conduct)and the other one of the word line pair is at a pass voltage (for whicha memory cell in either state will conduct). Depending on which wordline of the word line pair is at which value, the input will either be a−1 or +1 input. By applying the input to the word line pair, the unitsynapse (and corresponding NAND string) will either conduct or not,depending whether or not the input and the weight match. The result canbe determined by a sense amplifier connected to a corresponding bitline. By sequentially working through the input/unit synapse pairs alonga NAND string and accumulating the results of the sense amplifier, themultiply- and accumulate operations of propagating an input through alayer of a neural network can be performed. As the word lines of thearray span multiple NAND strings, the operation can be performedconcurrently for the binary weights of multiple unit synapse.

The degree of parallelism can be increased by the introduction ofmulti-bit sense amplifiers, so that the unit synapse from differentmemory blocks of the array can be sensed concurrently. Further increasesin parallelism can be obtained by concurrent sensing on multiple planesand pipelining the output of one plane, corresponding to one layer of aneural network, to be the input of another plane, corresponding to thesubsequent layer of a neural network.

FIG. 1 is a block diagram of one embodiment of a memory system 100connected to a host 120. Memory system 100 can implement the technologyproposed herein, where the neural network inputs or other data arereceived from the host 120. Depending on the embodiment, the inputs canbe received from the host 120 and then provided to the memory packages104 for inferencing on the weights previously programmed into the memoryarrays of the memory packages 104. Many different types of memorysystems can be used with the technology proposed herein. Example memorysystems include solid state drives (“SSDs”), memory cards and embeddedmemory devices; however, other types of memory systems can also be used.

Memory system 100 of FIG. 1 comprises a Controller 102, non-volatilememory 104 for storing data, and local memory (e.g. DRAM/ReRAM) 106.Controller 102 comprises a Front End Processor (FEP) circuit 110 and oneor more Back End Processor (BEP) circuits 112. In one embodiment FEPcircuit 110 is implemented on an ASIC. In one embodiment, each BEPcircuit 112 is implemented on a separate ASIC. In other embodiments, aunified controller ASIC can combine both the front end and back endfunctions. The ASICs for each of the BEP circuits 112 and the FEPcircuit 110 are implemented on the same semiconductor such that theController 102 is manufactured as a System on a Chip (“SoC”). FEPcircuit 110 and BEP circuit 112 both include their own processors. Inone embodiment, FEP circuit 110 and BEP circuit 112 work as a masterslave configuration where the FEP circuit 110 is the master and each BEPcircuit 112 is a slave. For example, FEP circuit 110 implements a FlashTranslation Layer (FTL) or Media Management Layer (MML) that performsmemory management (e.g., garbage collection, wear leveling, etc.),logical to physical address translation, communication with the host,management of DRAM (local volatile memory) and management of the overalloperation of the SSD (or other non-volatile storage system). The BEPcircuit 112 manages memory operations in the memory packages/die at therequest of FEP circuit 110. For example, the BEP circuit 112 can carryout the read, erase and programming processes. Additionally, the BEPcircuit 112 can perform buffer management, set specific voltage levelsrequired by the FEP circuit 110, perform error correction (ECC), controlthe Toggle Mode interfaces to the memory packages, etc. In oneembodiment, each BEP circuit 112 is responsible for its own set ofmemory packages.

In one embodiment, non-volatile memory 104 comprises a plurality ofmemory packages. Each memory package includes one or more memory die.Therefore, Controller 102 is connected to one or more non-volatilememory die. In one embodiment, each memory die in the memory packages104 utilize NAND flash memory (including two dimensional NAND flashmemory and/or three dimensional NAND flash memory). In otherembodiments, the memory package can include other types of memory.

Controller 102 communicates with host 120 via an interface 130 thatimplements NVM Express (NVMe) over PCI Express (PCIe). For working withmemory system 100, host 120 includes a host processor 122, host memory124, and a PCIe interface 126 connected along bus 128. Host memory 124is the host's physical memory, and can be DRAM, SRAM, non-volatilememory or another type of storage. Host 120 is external to and separatefrom memory system 100. In one embodiment, memory system 100 is embeddedin host 120.

FIG. 2 is a block diagram of one embodiment of FEP circuit 110. FIG. 2shows a PCIe interface 150 to communicate with host 120 and a hostprocessor 152 in communication with that PCIe interface. The hostprocessor 152 can be any type of processor known in the art that issuitable for the implementation. Host processor 152 is in communicationwith a network-on-chip (NOC) 154. A NOC is a communication subsystem onan integrated circuit, typically between cores in a SoC. NOCs can spansynchronous and asynchronous clock domains or use unclocked asynchronouslogic. NOC technology applies networking theory and methods to on-chipcommunications and brings notable improvements over conventional bus andcrossbar interconnections. NOC improves the scalability of SoCs and thepower efficiency of complex SoCs compared to other designs. The wiresand the links of the NOC are shared by many signals. A high level ofparallelism is achieved because all links in the NOC can operatesimultaneously on different data packets. Therefore, as the complexityof integrated subsystems keep growing, a NOC provides enhancedperformance (such as throughput) and scalability in comparison withprevious communication architectures (e.g., dedicated point-to-pointsignal wires, shared buses, or segmented buses with bridges). Connectedto and in communication with NOC 154 is the memory processor 156, SRAM160 and a DRAM controller 162. The DRAM controller 162 is used tooperate and communicate with the DRAM (e.g., DRAM 106). SRAM 160 islocal RAM memory used by memory processor 156. Memory processor 156 isused to run the FEP circuit and perform the various memory operations.Also, in communication with the NOC are two PCIe Interfaces 164 and 166.In the embodiment of FIG. 2, the SSD controller will include two BEPcircuits 112; therefore, there are two PCIe Interfaces 164/166. EachPCIe Interface communicates with one of the BEP circuits 112. In otherembodiments, there can be more or less than two BEP circuits 112;therefore, there can be more than two PCIe Interfaces.

FEP circuit 110 can also include a Flash Translation Layer (FTL) or,more generally, a Media Management Layer (MML) 158 that performs memorymanagement (e.g., garbage collection, wear leveling, load balancing,etc.), logical to physical address translation, communication with thehost, management of DRAM (local volatile memory) and management of theoverall operation of the SSD or other non-volatile storage system. Themedia management layer MML 158 may be integrated as part of the memorymanagement that may handle memory errors and interfacing with the host.In particular, MML may be a module in the FEP circuit 110 and may beresponsible for the internals of memory management. In particular, theMML 158 may include an algorithm in the memory device firmware whichtranslates writes from the host into writes to the memory structure(e.g., 326 of FIG. 5 below) of a die. The MML 158 may be neededbecause: 1) the memory may have limited endurance; 2) the memorystructure may only be written in multiples of pages; and/or 3) thememory structure may not be written unless it is erased as a block. TheMML 158 understands these potential limitations of the memory structurewhich may not be visible to the host. Accordingly, the MML 158 attemptsto translate the writes from host into writes into the memory structure.

FIG. 3 is a block diagram of one embodiment of the BEP circuit 112. FIG.3 shows a PCIe Interface 200 for communicating with the FEP circuit 110(e.g., communicating with one of PCIe Interfaces 164 and 166 of FIG. 2).PCIe Interface 200 is in communication with two NOCs 202 and 204. In oneembodiment the two NOCs can be combined into one large NOC. Each NOC(202/204) is connected to SRAM (230/260), a buffer (232/262), processor(220/250), and a data path controller (222/252) via an XOR engine(224/254) and an ECC engine (226/256). The ECC engines 226/256 are usedto perform error correction, as known in the art. The XOR engines224/254 are used to XOR the data so that data can be combined and storedin a manner that can be recovered in case there is a programming error.Data path controller 222 is connected to an interface module forcommunicating via four channels with memory packages. Thus, the top NOC202 is associated with an interface 228 for four channels forcommunicating with memory packages and the bottom NOC 204 is associatedwith an interface 258 for four additional channels for communicatingwith memory packages. Each interface 228/258 includes four Toggle Modeinterfaces (TM Interface), four buffers and four schedulers. There isone scheduler, buffer and TM Interface for each of the channels. Theprocessor can be any standard processor known in the art. The data pathcontrollers 222/252 can be a processor, FPGA, microprocessor or othertype of controller. The XOR engines 224/254 and ECC engines 226/256 arededicated hardware circuits, known as hardware accelerators. In otherembodiments, the XOR engines 224/254 and ECC engines 226/256 can beimplemented in software. The scheduler, buffer, and TM Interfaces arehardware circuits.

FIG. 4 is a block diagram of one embodiment of a memory package 104 thatincludes a plurality of memory die 292 connected to a memory bus (datalines and chip enable lines) 294. The memory bus 294 connects to aToggle Mode Interface 296 for communicating with the TM Interface of aBEP circuit 112 (see e.g., FIG. 3). In some embodiments, the memorypackage can include a small controller connected to the memory bus andthe TM Interface. The memory package can have one or more memory die. Inone embodiment, each memory package includes eight or 16 memory die;however, other numbers of memory die can also be implemented. Thetechnology described herein is not limited to any particular number ofmemory die.

FIG. 5 is a functional block diagram of one embodiment of a memory die300. The components depicted in FIG. 5 are electrical circuits. In oneembodiment, each memory die 300 includes a memory structure 326, controlcircuitry 310, and read/write circuits 328. Memory structure 126 isaddressable by word lines via a row decoder 324 and by bit lines via acolumn decoder 332. The read/write circuits 328 include multiple senseblocks 350 including SB1, SB2, . . . , SBp (sensing circuitry) and allowa page of memory cells to be read or programmed in parallel. Commandsand data are transferred between the Controller and the memory die 300via lines 318. In one embodiment, memory die 300 includes a set of inputand/or output (I/O) pins that connect to lines 318.

Control circuitry 310 cooperates with the read/write circuits 328 toperform memory operations (e.g., write, read, and others) on memorystructure 326, and includes a state machine 312, an on-chip addressdecoder 314, and a power control circuit 316. State machine 312 providesdie-level control of memory operations. In one embodiment, state machine312 is programmable by software. In other embodiments, state machine 312does not use software and is completely implemented in hardware (e.g.,electrical circuits). In another embodiment, state machine 312 isreplaced by a micro-controller. In one embodiment, control circuitry 310includes buffers such as registers, ROM fuses and other storage devicesfor storing default values such as base voltages and other parameters.

The on-chip address decoder 314 provides an address interface betweenaddresses used by Controller 102 to the hardware address used by thedecoders 324 and 332. Power control module 316 controls the power andvoltages supplied to the word lines and bit lines during memoryoperations. Power control module 316 may include charge pumps forcreating voltages. The sense blocks include bit line drivers.

For purposes of this document, the phrase “one or more control circuits”refers to a controller, a state machine, a micro-controller and/orcontrol circuitry 310, or other analogous circuits that are used tocontrol non-volatile memory.

In one embodiment, memory structure 326 comprises a three dimensionalmemory array of non-volatile memory cells in which multiple memorylevels are formed above a single substrate, such as a wafer. The memorystructure may comprise any type of non-volatile memory that aremonolithically formed in one or more physical levels of memory cellshaving an active area disposed above a silicon (or other type of)substrate. In one example, the non-volatile memory cells comprisevertical NAND strings with charge-trapping material such as described,for example, in U.S. Pat. No. 9,721,662, incorporated herein byreference in its entirety.

In another embodiment, memory structure 326 comprises a two dimensionalmemory array of non-volatile memory cells. In one example, thenon-volatile memory cells are NAND flash memory cells utilizing floatinggates such as described, for example, in U.S. Pat. No. 9,082,502,incorporated herein by reference in its entirety. Other types of memorycells (e.g., NOR-type flash memory) can also be used.

The exact type of memory array architecture or memory cell included inmemory structure 326 is not limited to the examples above. Manydifferent types of memory array architectures or memory technologies canbe used to form memory structure 326. No particular non-volatile memorytechnology is required for purposes of the new claimed embodimentsproposed herein. Other examples of suitable technologies for memorycells of the memory structure 326 include ReRAM memories,magnetoresistive memory (e.g., MRAM, Spin Transfer Torque MRAM, SpinOrbit Torque MRAM), phase change memory (e.g., PCM), and the like.Examples of suitable technologies for memory cell architectures of thememory structure 126 include two dimensional arrays, three dimensionalarrays, cross-point arrays, stacked two dimensional arrays, vertical bitline arrays, and the like.

One example of a ReRAM, or PCMRAM, cross point memory includesreversible resistance-switching elements arranged in cross point arraysaccessed by X lines and Y lines (e.g., word lines and bit lines). Inanother embodiment, the memory cells may include conductive bridgememory elements. A conductive bridge memory element may also be referredto as a programmable metallization cell. A conductive bridge memoryelement may be used as a state change element based on the physicalrelocation of ions within a solid electrolyte. In some cases, aconductive bridge memory element may include two solid metal electrodes,one relatively inert (e.g., tungsten) and the other electrochemicallyactive (e.g., silver or copper), with a thin film of the solidelectrolyte between the two electrodes. As temperature increases, themobility of the ions also increases causing the programming thresholdfor the conductive bridge memory cell to decrease. Thus, the conductivebridge memory element may have a wide range of programming thresholdsover temperature.

Magnetoresistive memory (MRAM) stores data by magnetic storage elements.The elements are formed from two ferromagnetic plates, each of which canhold a magnetization, separated by a thin insulating layer. One of thetwo plates is a permanent magnet set to a particular polarity; the otherplate's magnetization can be changed to match that of an external fieldto store memory. A memory device is built from a grid of such memorycells. In one embodiment for programming, each memory cell lies betweena pair of write lines arranged at right angles to each other, parallelto the cell, one above and one below the cell. When current is passedthrough them, an induced magnetic field is created.

Phase change memory (PCM) exploits the unique behavior of chalcogenideglass. One embodiment uses a GeTe-Sb2Te3 super lattice to achievenon-thermal phase changes by simply changing the co-ordination state ofthe Germanium atoms with a laser pulse (or light pulse from anothersource). Therefore, the doses of programming are laser pulses. Thememory cells can be inhibited by blocking the memory cells fromreceiving the light. In other PCM embodiments, the memory cells areprogrammed by current pulses. Note that the use of “pulse” in thisdocument does not require a square pulse but includes a (continuous ornon-continuous) vibration or burst of sound, current, voltage light, orother wave.

A person of ordinary skill in the art will recognize that the technologydescribed herein is not limited to a single specific memory structure,but covers many relevant memory structures within the spirit and scopeof the technology as described herein and as understood by one ofordinary skill in the art.

Turning now to types of data that can be stored on non-volatile memorydevices, a particular example of the type of data of interest in thefollowing discussion is the weights used is in deep neural networks. Anartificial neural network is formed of one or more intermediate layersbetween an input layer and an output layer. The neural network finds amathematical manipulation to turn the input into the output, movingthrough the layers calculating the probability of each output. FIG. 6illustrates a simple example of an artificial neural network.

In FIG. 6 an artificial neural network is represented as aninterconnected group of nodes or artificial neurons, represented by thecircles, and a set of connections from the output of one artificialneuron to the input of another. The example shows three input nodes (I₁,I₂, I₃) and two output nodes (O₁, O₂), with an intermediate layer offour hidden or intermediate nodes (H₁, H₂, H₃, H₄). The nodes, orartificial neurons/synapses, of the artificial neural network areimplemented by logic elements of a host or other processing system as amathematical function that receives one or more inputs and sums them toproduce an output. Usually each input is separately weighted and the sumis passed through the node's mathematical function to provide the node'soutput.

In common artificial neural network implementations, the signal at aconnection between nodes (artificial neurons/synapses) is a real number,and the output of each artificial neuron is computed by some non-linearfunction of the sum of its inputs. Nodes and their connections typicallyhave a weight that adjusts as a learning process proceeds. The weightincreases or decreases the strength of the signal at a connection. Nodesmay have a threshold such that the signal is only sent if the aggregatesignal crosses that threshold. Typically, the nodes are aggregated intolayers. Different layers may perform different kinds of transformationson their inputs. Signals travel from the first layer (the input layer),to the last layer (the output layer), possibly after traversing thelayers multiple times. Although FIG. 6 shows only a single intermediateor hidden layer, a complex deep neural network (DNN) can have many suchintermediate layers.

An artificial neural network is “trained” by supplying inputs and thenchecking and correcting the outputs. For example, a neural network thatis trained to recognize dog breeds will process a set of images andcalculate the probability that the dog in an image is a certain breed. Auser can review the results and select which probabilities the networkshould display (above a certain threshold, etc.) and return the proposedlabel. Each mathematical manipulation as such is considered a layer, andcomplex neural networks have many layers. Due to the depth provided by alarge number of intermediate or hidden layers, neural networks can modelcomplex non-linear relationships as they are trained.

FIG. 7A is a flowchart describing one embodiment of a process fortraining a neural network to generate a set of weights. The trainingprocess is often performed in the cloud, allowing additional or morepowerful processing the accessed. At step 701, the input, such as a setof images, is received at the input nodes (e.g., I₁, I₂, I₃ in FIG. 6).At step 703 the input is propagated through the nodes of the hiddenintermediate layers (e.g., H₁, H₂, H₃, H₄ in FIG. 6) using the currentset of weights. The neural network's output is then received at theoutput nodes (e.g., O₁, O₂ in FIG. 6) in step 705. In the dog breedexample of the preceding paragraph, the input would be the image data ofa number of dogs, and the intermediate layers use the current weightvalues to calculate the probability that the dog in an image is acertain breed, with the proposed dog breed label returned at step 705. Auser can then review the results at step 707 to select whichprobabilities the neural network should return and decide whether thecurrent set of weights supply a sufficiently accurate labelling and, ifso, the training is complete (step 711). If the result is notsufficiently accurate, the neural network adjusts the weights at step709 based on the probabilities the user selected, followed by loopingback to step 703 to run the input data again with the adjusted weights.Once the neural network's set of weights have been determined, the canbe used to “inference,” which is the process of using the determinedweights to generate an output result from data input into the neuralnetwork. Once the weights are determined at step 711, they can then bestored in non-volatile memory for later use, where the storage of theseweights in non-volatile memory is discussed in further detail below.

FIG. 7B is a flowchart describing a process for the inference phase ofsupervised learning using a neural network to predict the “meaning” ofthe input data using an estimated accuracy. Depending on the case, theneural network may be inferenced both at cloud and by an edge device's(e.g., smart phone, automobile process, hardware accelerator) processor.At step 721, the input is received, such as the image of a dog in theexample used above. If the previously determined weights are not presentin the device running the neural network application, they are loaded atstep 722. For example, on a host processor executing the neural network,the weight could be read out of an SSD in which they are stored andloaded into RAM on the host device. At step 723, the input data is thenpropagated through the neural network's layers. Step 723 will be similarto step 703 of FIG. 7B, but now using the weights established at the endof the training process at step 711. After propagating the input throughthe intermediate layer, the output is then provided at step 725.

Neural networks are typically feedforward networks in which data flowsfrom the input layer, through the intermediate layers, and to the outputlayer without looping back. At first, in the training phase ofsupervised learning as illustrated by FIG. 7A, the neural networkcreates a map of virtual neurons and assigns random numerical values, or“weights”, to connections between them. The weights and inputs aremultiplied and return an output between 0 and 1. If the network does notaccurately recognize a particular pattern, an algorithm adjusts theweights. That way the algorithm can make certain parameters moreinfluential (by increasing the corresponding weight) or less influential(by decreasing the weight) and adjust the weights accordingly until itdetermines a set of weights that provide a sufficiently correctmathematical manipulation to fully process the data.

FIG. 8 is a schematic representation of the use of matrix multiplicationin a neural network. Matrix multiplication, or MatMul, is a commonlyused approach in both the training and inference phases for neuralnetworks and is used in kernel methods for machine learning. FIG. 8 attop is similar to FIG. 6, where only a single hidden layer is shownbetween the input layer and the output layer. The input data isrepresented as a vector of a length corresponding to the number of inputnodes. The weights are represented in a weight matrix, where the numberof columns corresponds to the number of the number of intermediate nodesin the hidden layer and the number of rows corresponds to the number ofinput nodes. The output is determined by a matrix multiplication of theinput vector and the weight matrix, where each element of the outputvector is a dot product of the vector of the input data with a column ofthe weight matrix.

A common technique for executing the matrix multiplications is by use ofa multiplier-accumulator (MAC, or MAC unit). However, this has a numberof issues. Referring back to FIG. 7B, the inference phase loads theneural network weights at step 722 before the matrix multiplications areperformed by the propagation at step 723. However, as the amount of datainvolved can be extremely large, use of a multiplier-accumulator forinferencing has several issues related to loading of weights. One ofthese is high energy dissipation due to having to use large MAC arrayswith the required bit-width. Another is high energy dissipation due tothe limited size of MAC arrays, resulting in high data movement betweenlogic and memory and an energy dissipation that can be much higher thanused in the logic computations themselves.

To help avoid these limitations, the use of a multiplier-accumulatorarray can be replaced with other memory technologies. For example, thematrix multiplication can be computed within a memory array byleveraging the characteristics of Storage Class Memory (SCM), such asthose based on ReRAM, PCM, or MRAM based memory cells. This allows forthe neural network inputs to be provided via read commands and theneural weights to be preloaded for inferencing. By use of in-memorycomputing, this can remove the need for logic to perform the matrixmultiplication in the MAC array and the need to move data between thememory and the MAC array.

The following considers embodiments based on memory arrays using NANDtype of architectures, such as flash NAND memory using memory cells witha charge storage region. Flash NAND memory can be implemented using bothmulti-level cell (MLC) structures and single-level cell (SLC)structures, where the following mainly considers embodiments based onSLC Flash memory. In contrast to MAC array logic, use of SLC Flashmemory shows several advantages, including a much higher area/bit value,a much higher throughput rate, and a significant reduction in energydissipation due to minimizing data movement by performing in-arraymultiplication. Additionally, the NAND flash structure is highlyscalable, supporting deep and wide neural networks.

A technique that can be used to reduce the computational complexity ofthe inference process is by use of a Binarized Neural Network (BNN), inwhich a neural network works with binary weights and activations. A BNN(also called an XNOR-Net) computes the matrix-vector multiplication with“binary” inputs {−1, 1} and “binary” weights {−1, 1}. FIG. 9 is a tableillustrating the output of a binary neural network in response to thedifferent input-weight combinations. As shown in the right-most column,when the input and weight match, the output is 1; and when the input andthe weight differ, the output is −1. FIGS. 10−13 illustrate anembodiment for the realization of a neural network with binary-input andbinary-weights in an SLC NAND array.

FIG. 10 illustrates an embodiment for a unit synapse cell for storing abinary weight in a pair of series connected memory cells FG1 and FG2. Inthis example, each of the memory cells are SLC cells storing one of twostates and can be part of a larger NAND string. The memory cells FG1 andFG2 can be flash memory cells and are programmed or erased byrespectively adding or removing electrons from a charge storing layer ora floating gate, and are sensed by applying corresponding voltages V1and V2 to their control gates. When the memory cells FG1 and FG2 arepart of a larger NAND string that includes additional unit synapse cellsor other memory cells, the pair of memory cells can be adjacent on theNAND string or separated by other memory cells forming the NAND string.In the following discussion, the individual memory cells of a unitsynapse cell will be represented as being adjacent, but otherarrangement are possible depending on the embodiment. For example, theupper half of a NAND string could hold the first memory cell of eachunit synapse, with the second memory cell of each unit synapse in thelower half of the NAND string. For any of these arrangements, whensensing a given unit synapse, the other memory cells and select gates onthe same NAND string will be biased such that both of the memory cellsof the non-selected unit synapses and any other memory cells, along withthe select gates, are conducting.

FIG. 11 illustrates the distribution of threshold voltages for thestorage of data states on an SLC memory. In this embodiment, the erasednegative threshold state is taken as the “1” state and the positivethreshold state is taken as the “0”. FIG. 10 illustrates a typicallydistribution of the threshold voltage of the memory cells of a set ofmemory cells, such as an erase block or whole array, after the memorycells have been erased (here assigned the “1” state) and the memorycells to programmed to the positive threshold states (here assigned the“0” state). As discussed further with respect to FIGS. 12 and 13, abinary weight will have one memory cell of a unit synapse in the “0”state and the other memory cell in the “1” state. More generally, the“1” state need not be a negative threshold state as long as the twostates correspond to a lower threshold state, here defined as the “1′state, and a higher threshold state, here defined as the “0” state.

For sensing the memory cells with the threshold distribution illustratedin FIG. 11, a first voltage level Vread is used to distinguish betweenthe data states, so that if applied to the control gate of a memorycell, the memory cell will conduct if in the “1” state and not conductif in the “0” state. For example, if the “1” states are a negativethreshold voltage state and the “0” states are a positive thresholdvoltage state, Vread could be taken as 0V. A second sensing voltageVpass is high enough such that a memory cell in either state willconduct. For example, Vpass could be a few volts. In the following,Vread will be defined as the “0” input voltage value and Vpass will bedefined as the “1” input voltage value.

In implementations of NAND flash memory, a number of different voltagelevels are often used for sensing operations, both in program verify andread operations, for both SLC and MLC memory. For example, a programverify level for a given data state may be offset from the read voltagelevel for the same data state. Also, various levels may be used for passvoltages in different operations and conditions to place a memory cellin a conducting state independently of its stored data state. To simplythe following discussion, only the single Vread voltage will be used todifferentiate between the data states and only the single Vpass voltagewill be used when a memory cell or select gate is to be put into aconducting state for all stored data state values.

FIGS. 12 and 13 illustrate an embodiment for implementing a binaryneural network using a pair of series connected SLC memory cells as aunit synapse. More specifically, FIG. 13 shows one embodiment for thecorrespondence between input logic, weight logic, and output logic ofFIG. 9 and the input voltage patterns, threshold voltage Vth of the unitsynapse's memory cells, and the output voltage, respectively. FIG. 12 isa schematic representation of the response of a unit synapse to thedifferent cases.

In FIGS. 12 and 13, a logic input of −1 corresponds to the input voltagepattern of V1=Vpass=“1”, V2=Vread=“0”; and a logic input of +1corresponds to the input voltage pattern of V1=Vread=“0”, V2=Vpass=“1”.A weight logic of −1 corresponds to the memory cell FG1 being in the “0”(programmed) state and FG2 being in the “1” (erased state); and a weightlogic of +1 corresponds to the memory cell FG1 being in the “1” stateand FG2 being in the “0”. An output logic of +1 corresponds to the unitsynapse conducting a current Icell, resulting in an output voltage dropof ΔV across the unit synapse; and an output logic of −1 corresponds tothe unit synapse not conducting, resulting in little or no outputvoltage drop across the unit synapse.

FIG. 12 schematically represents the four cases of input, weight pairs.In case 1, the input and weight both match with values of −1. Theapplied input voltage pattern applies the higher input voltage of Vpass,or “1”, to upper cell with the higher Vth “0” data state and the lowerinput voltage of Vread, or “0”, to the lower cell with the lower Vth “1”data state, so that cells are conductive and pass a current of I cell.In case 2, the input voltage pattern is reversed with respect to case 1,with the input logic is now at +1 while the weight is at −1. Thisresults in the lower Vpass, or “0”, voltage level applied to the topcell in higher Vth, which consequently will not be conductive (asindicated by the X under the memory cell) and no appreciable currentwill flow thought the pair.

For cases 3 and 4 on the bottom of FIG. 12, the weight value is now +1,with the lower Vth “1” state in the upper cell and the upper Vth “0”programmed in to the lower cell. In case 3, the −1 input voltage patternis applied to the unit synapse, resulting the lower cell not conductingas it receives the lower Vread, or “0”, voltage level. In case 4, thehigher Vpass, or “1” input is now applied to the lower memory cell,which consequently conducts, and the unit synapse passes the currentIcell.

As represented in the embodiment of FIGS. 12 and 13, the use of a pairof series connected memory cells of FIG. 10 as a unit synapse can beused to implement the binary neural network logic table of FIG. 9. Theunit synapses can be incorporated into larger NAND strings of multiplesuch series connected unit synapses. When sensing a selected unitsynapse on a NAND string, other unit synapses on the same NAND stringcan be biased to be on by using a Vpass voltage, with the NAND stingsselect gates also biased to be on.

The use of NAND flash memory to store weight and compute the dotproducts of inputs and weights in-array can be used in both the trainingand inference phases. The training phase can proceed as in the flow ofFIG. 7A, where step 709 would erase and reprogram the weights as neededto adjust the weights until determined to be sufficiently accurate atstep 707. The present discussion will mostly focus on the inferencephase, where the weights have previously been determined in a trainingprocess and then loaded into a NAND memory by programming of the unitsynapses to the determined binary weight values.

FIG. 14 illustrates the incorporation of the unit synapses into a NANDarray, such as in the memory structure 326 of FIG. 5. FIG. 14 shows oneblock of what can be a larger array of many blocks, each with multipleNAND strings connected between a source line 1415 and a correspondingbit line BLi 1403 i. A typical NAND memory array will be formed of manysuch memory blocks. Each NAND string is formed of a number of seriesmemory cells connected in series between a source side select gate SSLi1409 i, by which the NAND string is connected to the source line 1415,and a drain side select gate DSLi 1407 i, by which the NAND string isconnect to the corresponding bit line BLi 1403 i.

The memory cells along each NAND string are paired into unit synapses ofa pair of memory cells storing a weight W^(i,j), as illustrated by theunit synapse of FIG. 10. Each of the NAND strings can have one or moreunit synapse connected in series, where the embodiment of FIG. 14illustrates 32 unit synapses per NAND string. Each unit synapse canstore a binary weight and is connected along a pair of word linesWL<j>1405 j and WL′<j>1405′j that receive a corresponding logic inputInput<j> corresponding to the voltages of FIG. 13. The word line pairsWL<j>1405 j and WL′<j>1405′j span the columns of NAND strings of theblock. In the embodiment of FIG. 14, the memory cells of a unit synapseare adjacent on the NAND string, but other arrangements can be used suchthat the memory cells of the synapses are interleaved rather than beingcontiguous; and although the discussion here is focused on binaryweights using two SLC memory cells per synapse, other embodiments canuse more memory cells per unit synapse, multi-level memory cells, orboth, to store neural network weights with more than the two values ofthe binary example. Additionally, although the NAND strings in the shownembodiment are formed of charge storing, flash memory cells, othermemory cells with the same array architecture can also be used.

The determination of the output of a unit synapse 1401 i,j storingweight W^(i,j), can be determined by applying an input logic voltagepattern to the corresponding input to Input<j>, while the other memorycells and select gates of the selected NAND string are biased to be ON.Based on the input logic and weight logic, the unit synapse storing 1401i,j weight W^(i,j) will either conduct or not, as represented in thetable of FIG. 15, which can be determined by the corresponding senseamplifier SAi 1411 i. As discussed further below, for each bit line acorresponding counter-based digital summation circuit CSCi 1413 i cankeep track of how many of the unit synapses along the bit line conductin response to the inputs, summing these values, where the senseamplifiers and summation circuits can be part of the Sense Blocks 350 ofFIG. 5. The same input Input<j> is applied concurrently to all of theunit synapses 1401 i,j storing weight W^(i,j) for all of the bit linesBLi 1403 i biasing the select gates of the corresponding select gatesSSLi 1409 i and DSLi 1407 i. Consequently, the same input can be appliedto multiple synapses concurrently. The different synapses along the NANDstrings can selected sequentially for sensing, with the results alongeach bit line BLi 1403 i being accumulated by CSCi 1413 i. In a NANDmemory, a page is the unit of read and program, where the read page andprogram page are usually taken to be the same, such as the whole of thememory cells connected along a word line or some portion of the memorycells along a common word line. For programming, the data of the unitsynapses along a single word line would still be programmed word line byword line; however, relative to a standard NAND memory operation, wherethe goal to determine the data content of the individual memory cells,the reading of a page of the binary weight unit synapses is performed inword line pairs such that the read page in this case can be taken ascorresponding to a word line pair.

Referring back to FIG. 8, matrix multiplication is a multiple sum-ofproduct (dot-product) calculation for input-weight vector pairs(row-column of input matrixes) used for inferencing in a neural network.FIGS. 15 and 16 consider an example of the computation of a dot-productfor the binary neural network algebra and how to implement this using acounter based summation digital circuit for an SLC NAND BNN embodiment.More specifically, although a binary neural network based on the logicillustrated by the table of FIG. 8 is based on the weights, inputs, andoutputs as having the values of either +1 or −1, when implemented by aNAND array as illustrate by FIG. 14, a sense amplifier will eitherregister as conducting (“1”) or not conducting (“0”). Consequently, forthe counter-based digital summation circuits CSCi 1413 i to accumulatethe results to compute the dot-product of the matrix multiplicationrequires a conversion of the (+1, −1) based values to a (1,0) basis,where the −1 values are replaced by 0.

The table of FIG. 15 considers the dot product of the example of an 8element binary neural network input vector I^(bnn) across the top rowand an 8 element binary neural network weight vector W^(bnn) in thesecond row when the vector elements are all quantized to −1/+1. Thethird row illustrates the element by element product of I^(bnn) andW^(bnn), equaling +1 when the two match and −1 when these differ. Thedot product is then based on summing these bit by bit products togenerate the dot-product P^(bnn_dec) of the two vectors. In decimalsystem, the final correct result of adding up these values is calculatedas P^(bnn _dec)=2.

On the top two rows of the table of FIG. 16, the input vector I^(bnn)and weight vector W^(bnn) are converted into the I/O binary basis forthe same vectors as in FIG. 15. The third row of FIG. 16 illustrates thecorresponding sense amplifier output, being the bit by bit XNOR value ofthe two vectors, which is 1 when the values match and 0 when the valuesdiffer. By accumulating these values from the sense amplifiers SAi 1411i in the corresponding summation circuits CSCi 1413 i to determine theirsum, this produces a popcount CNT^(bnn_out) corresponding to the number1 values. In the example of FIG. 16, CNT^(bnn_out)=5, which differs fromthe P^(bnn_dec)=2 value of FIG. 15 as the result of a mismatch in theinput and weight is now a 0 rather than a −1.

To correct for this and determine P^(bnn_dec) in the binary system, asubstitution of the output of popcount operand CNT^(bnn_out) into Eq. 1can be used to obtain a derived P^(bnn_dec):

P ^(bnn_dec)=2*CNT ^(bnn_out) −S,  (Eq. 1)

where S is the size of vector. In this example S=8, so thatP^(bnn_dec)=2*5−8=2, which is the exact P^(bnn_dec)=₂ for thedot-product of FIG. 15.

FIG. 17 is a flowchart for one embodiment of a dot-product calculationusing a binary neural network in inference, as illustrated in FIGS. 15and 16. At step 1701, a first input value is applied to a weight of afirst unit synapse to perform an in-array multiplication. Referring backto FIG. 14, this corresponds to applying an Input<j> value to acorresponding selected unit synapse 1401 i,j storing weight W^(i,j) on abit line BLi 1403 i, for example Input<0> applied to the bottom-mostunit synapse on BL0. At step 1703, the corresponding sense amplifier SAi1411 i determines whether the NAND string is conducting (1) or not (0),corresponding to an XNOR-ing of the input and weight values. Step 1705performs the accumulation, with the sensing result added to aCNT^(bnn_out) value maintained by the counter CSCi 1413 i. At step 1707,it is determined if there are more input/weight pairs to contribute tothe dot-product, corresponding to another input/weight pair for the NAND(or for other NAND strings on other blocks connected along the bit line)and, if so, loops back to step 1701. If all the input/weight pairs havebeen computed and summed for the CNT^(bnn_out) of the dot product, theflow move on to step 1709 to convert the popcount CNT^(bnn_out) value tothe dot-product P^(bnn_dec) by use of Eq. 1. In the example of thetables of FIGS. 15 and 16, the S value for Eq. 1 would be 8, while foran entire NAND string as illustrated in FIG. 14 S=32. Note that the NANDarray structure of FIG. 14 allows for the computation of a dot-productaccording to the flow of FIG. 17 to be performed concurrently along eachbit line.

FIG. 18 illustrates an embodiment of summation circuit for an SLC NANDarray to support binary neural networks. More specifically, FIG. 18repeats many of the elements of FIG. 16 in a somewhat simplified form,but also shows a word line decoder block 1811. The word line decoder1811 received the inputs, either a −1 or +1 input for a selected unitsynapse, which are then translated into the corresponding voltagepattern for the word line pairs WL<j>, WL′<j> and applied to theselected unit synapse one of the word line pairs (those of the selectedunit synapse). For non-selected unit synapses on the NAND string and forthe select gates, the word lines and select lines will be set to be on,such as at the voltage level of Vpass. Based on these inputs, thecounter-based summation digital circuits CSCi 1413 i of each of the bitlines can increase the count based on the output of the sense amplifierSAi 1411 i in the accumulation process.

FIG. 19 is a flowchart for one embodiment of a dot-product calculationusing a binary neural network in inference, as illustrated in the tablesof FIGS. 15 and 16 and array architecture of FIG. 18. Beginning at step1901, and referring FIG. 18, the memory array receives an input Input<j>of and translates this into a set of voltage values, corresponding to a−1 or +1 input value; and at step 1903 applies the voltage level to aword line pair WL<j>, WL′<j>1405 j, 1405′j. As the word lines span theNAND string of the selected block, the process of FIG. 19 can beperformed concurrently for any of the NAND strings for the unit synapsesconnected along the word line pair WL<j>, WL′<j>1405 j, 1405′j.Additionally, in the NAND structure, the other elements of a selectedNAND string (SSLi 1409 i, DSLi 1407 i, and the non-selected memory cellsof the NAND string) will be biased to be on, such as applying Vpass, atstep 1905. Although listed as an ordered set of separate steps in FIG.19, steps 1903 and 1905 are typically performed concurrently by the wordline decoder 1811.

Step 1907 determines the conductivity of set of memory cells of theselected unit synapse. As illustrated in the table of FIG. 15, theconductivity of the NAND string corresponds to the output logic value ofthe unit synapse in response to the input and can be determined by thesense amplifier SAi 1411 i. Based on the conductivity state of the unitsynapse, at step 1909 the value of count of the corresponding CSCi 1413i is either incremented or not as discussed above with respect Eq. 1 andthe table of FIG. 16.

Step 1911 determines if there are more input, weight pairs to add to thedot-product and, if so, the flow loops back to step 1901. Once thecontributions of all of the input, weight pairs to the dot products havebeen determined, the dot product can be provided at step 1913. The setof dot-products determined at step 1913 can then serve as the input to asubsequent neural network layer or be the output of inference process.

FIGS. 20 and 21 illustrate an example of a neural network and itsimplementation through a NAND array. In the process described above withrespect to FIG. 19, the response to an input of one unit synapse alongeach bit line is determined based on whether the corresponding senseamplifier determines the unit synapse to conduct or not. For a givenblock, the contribution of each of the synapses along a NAND string isdetermined sequentially by the sense amplifiers.

FIG. 20 illustrates an example of three fully connected layers of fournodes each, so that the weight matrix between the layer is a 4×4 matrix.In FIG. 20, the inputs at the nodes are labelled as I^(l,i,n), where lis the layer index, i is the input index, and n is the neuron index. Inthe example of FIG. 20, three layers are shown, l=(0,1,2), and each hasfour nodes, n=(0,1,2,3). (The input index is used in some of thefollowing examples of increased parallelism.) The weight matricesW^(l,n,n) connecting the layers are then 4×4 where the matrixmultiplication to form the dot-products from the inputs of one layer tothe next is:

I ^(1+l,i,n) =I ^(l,i,n) *W ^(l,n,n).

The inputs of one layer are applied as voltage patterns on the word linepairs to the unit synapses to generate dot product values that are theinputs of the next layer.

FIG. 21 is schematic representation of how these weight matrices arestored in the unit synapses of a NAND array for the in-arraycomputations of matrix multiplication. Relative to FIG. 18, the block(here labelled Block 0) is represented in terms of the weights stored inthe unit synapses, rather than the corresponding memory cell pairs, thevoltage level input patterns are represented as a single input, ratherthan the voltage levels applied to the corresponding word line pairs.The weight matrix between a pair of layers is then stored in a number ofunit synapses along a number of NAND strings, where the number of unitsynapses per NAND string and the number of NAND strings corresponds tothe size of the weight matrix. In this example of 4×4 weight matrices,this corresponds to 4 unit synapses along 4 NAND strings. As representedin FIG. 21 these are 4 adjacent unit synapses on 4 adjacent bit lines,but these can be distribution across the block differently depending onthe embodiment.

Relative to the representation of FIG. 20, a weight matrix is stored onthe NAND array in a transposed form. For example, the weights from thedifferent inputs of first layer of FIG. 20 into the top node 2001 of thesecond layer are stored along the first NAND string connected to BL0;and the weights into the bottom node 2003 are stored along the fourthNAND string connected to BL3. To illustrate the correspondence, thereference numbers 2001 and 2003 are also used in FIG. 21 to illustratethe placement of the corresponding weights into these nodes.

To compute the different dot-products of the matrix multiplication, thedata inputs are provided in a sequence of read commands. To compute theoutput of single layer, the pages of weights are then read sequentiallyby the sense amplifiers over, in this example, four cycles:

-   -   cycle 1: achieve I^(0,0,0)*W^(0,0,0)    -   cycle 2: achieve I^(0,0,1)*W^(0,0,1)    -   cycle 3: achieve I^(0,0,2)*W^(0,0,2)    -   cycle 4: achieve I^(0,0,3)*W^(0,0,3,)        where each of the cycles corresponds to a loop in the flow of        FIG. 19 and different sensing orders can be used in different        embodiments. The results of the cycles are sensed by the sense        amplifier SA on each of the bit lines and accumulated in the        CSCs, where the latency of the accumulation process is hidden        under the concurrent multiply operations for the following        cycles read. The output P^(n) from each bit line will then be        the inputs I^(l+l,i,n) of the next layer.

FIG. 22 illustrates an example of a neural network and itsimplementation through a NAND array to achieve a high parallelism acrossNAND blocks by leveraging multiple blocks within a single plane. In theprocess described above with respect to FIGS. 19 and 21, the response toan input of one unit synapse along each bit line is determined based onwhether the corresponding sense amplifier determines the unit synapse toconduct or not. FIG. 22 considers an embodiment using a multi-bit senseamplifier, such as one that can distinguish between different currentlevel, allowing multiple blocks within a single plane to be sensedconcurrently.

In a standard read operation where the object is to determine the datastate stored in a memory cell, the determination is made by a senseamplifier based on a current or voltage level along on a bit line basedon whether or not the selected memory cell conducts. If multiple cellsalong a common bit line were sensed at the same time, where some conductand some do not conduct, it would not be possible to determine which ofthe individual memory were the conducting cells memory cells andestablish their corresponding data states. For the counter's outputP^(n) from the matrix multiplication, however, it is only the sum of thenumber of unit synapses that conduct in response to the inputs that isof concern, not which of the individual synapses contribute.Consequently, the response of multiple unit synapses on different blocksin response to a corresponding set of inputs can be determinedconcurrently, thereby increasing parallelism, if the sense amplifier isable to determine the number of conducting synapses. By incorporatingmulti-sense amplifiers, the embodiment of FIG. 22 lets multiple unitsynapses along a common bit line from differing block to be sensed inparallel.

FIG. 22 is arranged similarly to FIG. 21 and is again shown storing thesame 4×4 weight matrix connecting the first two layers of FIG. 20. FIG.22 differs FIG. 21 in that the weights are now distributed between twodifferent blocks, here labelled Block 0 and Block 1, but these could beany two blocks of the same plane and the discussion can be extended tomore than two blocks to further increase parallelism. As discussed abovewith respect to FIGS. 20 and 21, the weight matrix is again stored in atransposed form.

To perform a matrix multiplication, data inputs are provided in asequence of read commands, but to compute the output of single layer,multiple blocks are now read in parallel (one page of unit synapses perblock). In the example of FIG. 22 for the matrices of FIG. 20, where twoblocks are activated concurrently, an output of a layer can be computedwithin a 2-cycle latency:

-   -   cycle 1: achieve I^(0,0,0)*W^(0,0,0)+I^(0,0,2)*W^(0,0,2)    -   cycle 2: achieve I^(0,0,1)*W^(0,0,1)+I^(0,0,3)*W^(0,0,3)        where cycle 2 is accumulated while the output is calculated for        cycle 1, so that the accumulation latency is hidden under        concurrent multiply operations.

FIG. 23 is a flowchart for one embodiment of a dot-product calculationsimilarly to FIG. 17, but that incorporates the multi-block parallelismillustrated by FIG. 22. Relative to step 1701, the parallel sensing ofmultiple blocks at step 2301 can now concurrently apply multiple inputsconcurrently in each loop. At step 2303, the output of the senseamplifier is now a multi-bit value, rather than the binary value of step1703, and corresponds to the number of conducting unit synapses along abit line. The multi-bit value is then accumulated at step 2305, with thesteps 2305, 2307, and 2309 corresponding to steps 1705, 1707, and 1709of FIG. 17.

To further increase parallelism, the number of blocks sensedconcurrently can be increased beyond the two shown in the example ofFIG. 22 up to the total number of inputs for layer. The degree ofparallelism can be based on considerations including the amount of theresultant current that would be drawn and the level of resolution thatcan reasonably achieved by the multi-bit sense amplifiers from theavailable current window.

FIG. 24 illustrates additional embodiments that can further increaseparallelism by using an architecture that can inference for the inputsof a neural network concurrently across multiple planes. The multipleplane implementation can be used for sensing a single block at a timewithin each plane (as in FIG. 21) or for multiple blocks at a timewithin each plane (as in FIG. 22). The example of FIG. 24 is again basedon the example of the network of FIG. 20 and uses two planes and twoblocks within each plane, although both the number of planes and blockscan be extended.

FIG. 24 shows two planes, Plane 0 and Plane 1, for an embodiment wheretwo blocks per plane are sensed concurrently, where the planes can be ona common die or on different die. For both of Plane 0 and Plane 1, theweights are stored as in FIG. 22 and the other elements are alsorepeated from FIG. 22. Where the planes differ is that input index forthe two planes differ, with inputs I^(0,0,n) for Plane 0 and thesubsequent set of inputs to the layer of I^(0,1,n) for Plane 1.

In block-level parallelism, the memory can use multiple blocks of singleplane to compute one output of a single layer, where the read commandscan be issued in parallel to access multiple blocks as described withrespect to the FIG. 22, with one page (of unit synapses) accessed perblock in a cycle. By adding the plane-level parallelism of FIG. 24,multiple planes can be used to compute multiple outputs of a singlelayer by using the same weight matrix is stored in both planes and wheredata can be provided to both planes in parallel. In the embodiment ofFIG. 24, using 2 planes with 2 blocks/plane in parallel, the two outputsof a single layer can be computed within a 2-cycle latency, where theaccumulation latency is hidden under multiplication (read command).

Parallelism can also be increased through use of plane pipelining, wherethe output of one plane (corresponding to the matrix multiplicationbetween one set of nodes) can be used as the input of another plane(corresponding to the matrix multiplication between the next set ofnodes). Plane pipelining can further be combined block levelparallelism, plane level parallelism, or both to achieve even greaterlevels of parallelism.

FIG. 25 illustrates an embodiment of plane pipelining for differentneural network layers. Referring back to the example of FIG. 20, thefirst stage in the pipeline stores the weight matrix between layers 0and 1, and next stage stores the weight matrix connected layers 1 and 2.The example of FIG. 25 is for two stages, and also includes 2-planeparallelism and 2-block parallelism, but these are each independentaspects and more pipeline stages can be similarly incorporated and thedegree of both plane block level parallelism be higher when suchadditional parallelism is included. The planes can be formed on a singledie or on multiple die.

At the top of FIG. 25, Plane 0,0 and Plane 0,1 are arranged as Plane 0and Plane 1 for the embodiment of FIG. 24 and receive the inputsI^(0,0,n) for Plane 0,0 and I^(0,1,n) for Plane 0,1. Plane 0,0 and Plane0,1 compute the outputs of layer-0 using block and plane-levelparallelism to generate inputs I^(0,1,n) and I^(1,1,n) for the nextstages in the pipeline of Plane 1,0 and Plane 1,1. In the lower part ofFIG. 25, Plane 1,0 and Plane 1,1 are arranged as for the previouspipeline stage in Plane 0,0 and Plane 0,1, but now store the weightmatrix entries W^(l,n,n) (again stored in transposed form) of the secondrather than the W^(0,n,n) entries of the first layer. By supplying theoutputs of the first stage to the second stage and applying the inputsI^(1,0,n) and I^(1,1,n) to the layer-1 matrix entries, the outputs oflayer-1 are then computed.

It should be noted that the weights of different layers can be stored inthe same block, same plane, or both, although this reduces the degree ofparallelism as the matrix multiplication of the different layers wouldnot be performed concurrently. This is illustrated by the embodiment ofFIG. 26.

FIG. 26 illustrates an embodiment in which weights of different layerscan be stored in the same block, same plane, or, in this case, both.More specifically, FIG. 26 shows one plane with the inputs for twolayers on one plane, with weights for each in the same block. In thisexample, the layer 1 weights that were in Plane 1,0 of FIG. 25 are nowin the same blocks with the layer 0 weights that were in Plane 0,0 ofFIG. 25. Thus, Block 0 in FIG. 26 includes the weights for Input<0> andInput<1> for both of layer 0 and layer 1, and Block 1 includes theweights for Input<2> and Input<3> for both of layer 0 and layer 1. Theinputs I^(0,0,n) for layer 0 generate the outputs P^(n) of I^(1,0,n) forlayer 0 can then be computed as described with respect to FIG. 22 in afirst set of reads. The I^(1,0,n) the serve as the input for layer 1,again as described with respect to FIG. 22, but with the layer 1 weightmatrix values W^(l,n,n) to generate the layer 1 outputs in a second setof reads.

The embodiments above present methods and architecture to realize theinference phase of a binary neural network with binary inputs and binaryweights in a NAND memory structure. By use of two serial connectedmemory cells as a unit synapse, binary weights of neural networks can beencoded and stored in a NAND memory array. These techniques allow forin-array implementations of matrix multiplication with improvedinference accuracy when binary neural networks for large datasets andcomplicated deep neural network (DNN) structures.

Relative to a standard NAND-based architecture, the describedembodiments preset a few small feature changes for the existing NANDmemory architecture to support various levels of computing parallelism.For the program and erase operations, no circuit changes are needed. Amodification is introduced on row, block, and/or plane decoders forcontrolling read operations to sense weights stored on the two-cell unitsynapses, as these use double word line selection with different voltagecontrol and, for multi-block embodiments, multiple blocks selection. Todetect 0 inputs, a modified counter-based summation digital circuit isintroduced along with a zero input detection circuit. By introducing amulti-bit sense amplifier, parallel computation across blocks and planescan also be used.

According to a first set of aspects, an apparatus includes an array ofnon-volatile memory cells and one or more control circuits connected tothe array of non-volatile memory cells. The array of non-volatile memorycells are arranged as NAND strings and configured to store a pluralityof weights of a neural network, each weight stored in a plurality ofnon-volatile memory cells on a common NAND string. The one or morecontrol circuits are configured to receive a plurality of inputs for alayer of a neural network, convert the plurality of inputs into acorresponding plurality of voltage patterns, apply the plurality ofvoltage patterns to the array of non-volatile memory cells to therebyperform an in-array multiplication of the plurality of inputs with theweights, and accumulate results of the in-array multiplication.

In additional aspects, an apparatus includes an array of memory cells, aword line decoder, and a multi-bit sense amplifier. The array of memorycells includes: a bit line; a source line; and a plurality of NANDstring, each including a plurality of memory cells and each connectedbetween the bit line and the source line. The word line decoder isconnected to the memory cells and configured to bias a first pluralityof the NAND strings to perform a concurrent sensing operation on thefirst plurality of NAND strings. The multi-bit sense amplifier isconnected to the bit line and configured to determine the number of thefirst plurality of NAND strings conducting in the concurrent sensingoperation.

Further aspects include a method that includes receiving a plurality ofinput values and translating each of the plurality of input values inputinto a corresponding voltage pattern. Each voltage pattern is one of aplurality of voltage patterns comprising a set of N voltage values. Theplurality of voltage patterns is applied to one or more NAND stringsconnected to a shared bit line. No more than one of the voltage patternsis applied to any single one of the NAND strings at a time and the setof N voltage values of each of the voltage patterns are applied to acorresponding N memory cells of the NAND string to which the voltagepattern is applied. The number of times that the one of the one or moreNAND strings conduct is determined in response to the to the pluralityof voltage patterns being applied to the to one or more NAND stringsconnected to the shared bit line.

For purposes of this document, reference in the specification to “anembodiment,” “one embodiment,” “some embodiments,” or “anotherembodiment” may be used to describe different embodiments or the sameembodiment.

For purposes of this document, a connection may be a direct connectionor an indirect connection (e.g., via one or more other parts). In somecases, when an element is referred to as being connected or coupled toanother element, the element may be directly connected to the otherelement or indirectly connected to the other element via interveningelements. When an element is referred to as being directly connected toanother element, then there are no intervening elements between theelement and the other element. Two devices are “in communication” ifthey are directly or indirectly connected so that they can communicateelectronic signals between them.

For purposes of this document, the term “based on” may be read as “basedat least in part on.”

For purposes of this document, without additional context, use ofnumerical terms such as a “first” object, a “second” object, and a“third” object may not imply an ordering of objects, but may instead beused for identification purposes to identify different objects.

For purposes of this document, the term “set” of objects may refer to a“set” of one or more of the objects.

The foregoing detailed description has been presented for purposes ofillustration and description. h It is not intended to be exhaustive orto limit to the precise form disclosed. Many modifications andvariations are possible in light of the above teaching. The describedembodiments were chosen in order to best explain the principles of theproposed technology and its practical application, to thereby enableothers skilled in the art to best utilize it in various embodiments andwith various modifications as are suited to the particular usecontemplated. It is intended that the scope be defined by the claimsappended hereto.

What is claimed is:
 1. An apparatus, comprising: one or more controlcircuits configured to connect to an array of non-volatile memory cells,the memory cells of the array arranged as NAND strings configured tostore a plurality of weights of a neural network in a binary format witheach of the weights stored in a pair of memory cells on a shared NANDstring, the one or more control circuits configured to: receive aplurality of binary valued inputs for a layer of a neural network; andperform an in-array inference operation between the plurality of inputsfor the layer of the neural network and the weights of the neuralnetwork.
 2. The apparatus of claim 1, wherein, in performing thein-array inference operation, the one or more control circuits areconfigured to: convert each of the plurality of inputs into acorresponding one of a plurality of voltage patterns, each of thevoltage patterns including a pair of voltage values; apply the pluralityof voltage patterns to the array of non-volatile memory cells to therebyperform an in-array multiplication of the plurality of inputs with theweights; and accumulate results of the in-array multiplication.
 3. Theapparatus of claim 2, further comprising: a memory die comprising thememory array, wherein each of the plurality of weights is stored in apair of non-volatile memory cells, one of which is in a programmed stateand the other of which is in an erased state.
 4. The apparatus of claim3, wherein the array includes: a bit line; and a source line, whereinthe NAND strings each include a plurality of memory cells and are eachconnected between the bit line and the source line, and wherein the oneor more control circuits include: a word line decoder connected to thememory cells and configured to bias a first plurality of the NANDstrings to perform a concurrent sensing operation on the first pluralityof the NAND strings; and a multi-bit sense amplifier connected to thebit line and configured to determine a number of the first plurality ofthe NAND strings conducting in the concurrent sensing operation.
 5. Theapparatus of claim 4, the one or more control circuits furtherincluding: a counter connected to the multi-bit sense amplifier andconfigured to increment a count value by corresponding to the number ofthe first plurality of the NAND strings conducting in the concurrentsensing operation.
 6. The apparatus of claim 5, wherein each of theconcurrent sensing operations includes simultaneously sensing aplurality of memory cells on each of the first plurality of the NANDstrings.
 7. The apparatus of claim 3, wherein: the array of non-volatilememory cells includes a plurality of NAND strings connected to a sharedbit line; and the one or more control circuits are further configured toconcurrently apply the plurality of voltage patterns to the plurality ofNAND strings connected to the shared bit line and accumulate the resultsof the in-array multiplication in a multi-bit sensing operation for theshared bit line.
 8. The apparatus of claim 3, wherein: the array ofnon-volatile memory cells includes a plurality of NAND strings connectedto a shared bit line; and the one or more control circuits are furtherconfigured to sequentially apply the plurality of voltage patterns tothe plurality of NAND strings connected to the shared bit line andaccumulate the results of the in-array multiplication in sequentialsensing operations.
 9. The apparatus of claim 3, wherein the array ofnon-volatile memory cells includes: a first plurality of NAND stringseach connected to a corresponding bit line; and the one or more controlcircuits are further configured to concurrently apply a first of theplurality of voltage patterns to the first plurality of NAND strings andindependently accumulate a result of the in-array multiplication foreach of the first plurality of NAND strings concurrently.
 10. Theapparatus of claim 2, wherein the one or more control circuits arefurther configured to provide accumulated results of the in-arraymultiplication as inputs for a subsequent layer of the neural network.11. A method, comprising: receiving one or more input values;translating each of the one or more input values input values into oneor more voltage values; applying the one or more voltage values to aplurality of word lines of a non-volatile memory array, the arrayincluding a plurality of NAND strings, each including a plurality ofmemory cells connected to one of the word; while applying the one ormore voltage values to the plurality of word lines of the array,performing a concurrent sensing operation on the plurality of NANDstrings; and determining a number of the plurality of NAND strings thatare conducting in the concurrent sensing operation.
 12. The method ofclaim 11, further comprising: incrementing a count value bycorresponding to the number of the plurality of NAND strings conductingin the concurrent sensing operation.
 13. The method of claim 11, whereinperforming the concurrent sensing operation includes: simultaneouslysensing a plurality of memory cells on each of the plurality of NANDstrings.
 14. The method of claim 13, wherein each of the plurality ofmemory cells on each of the plurality of NAND strings stores a weight ofa neural network.
 15. The method of claim 14, wherein the weights arebinary weights.
 16. An apparatus, comprising: an array of non-volatilememory cells including a bit line, a plurality of word lines, and afirst plurality of NAND strings each connected to the bit line and eachincluding a plurality of memory cells each connected to a correspondingone of the word lines; and one or more control circuits connected to theword lines and the NAND strings, the one or more control circuitsconfigured to: concurrently apply, for each of a first plurality of theNAND strings, one of a plurality of sensing voltages to one or morefirst word lines connected to a corresponding one or more memory cells;and determining a number of the first plurality of the NAND strings thatconduct in response to concurrently applying, for each of a firstplurality of the NAND strings, the one of the plurality of sensingvoltages to the one or more first word lines connected to acorresponding one or more memory cells.
 17. The apparatus of claim 16,wherein the one or more control circuits are further configured to:increment a count value by a number corresponding to the number of thefirst plurality of NAND strings conducting in response concurrentlyapplying the one of the plurality of sensing voltages to the one or morefirst word lines connected to a corresponding one or more memory cells.18. The apparatus of claim 17, wherein the one or more control circuitsare further configured to: subsequent to incrementing the count value,concurrently apply, for each of one or more of the first plurality ofthe NAND strings, one of a plurality of sensing voltages to one or moresecond word lines connected to a corresponding one or more memory cells;determining a number of the first plurality of the NAND strings thatconduct in response to concurrently applying, for each of a firstplurality of the NAND strings, the one of the plurality of sensingvoltages to the one or more second word lines connected to acorresponding one or more memory cells; and further increment the countvalue by a number corresponding to the number of the first plurality ofNAND strings conducting in response concurrently applying the one of theplurality of sensing voltages to the one or more second word linesconnected to a corresponding one or more memory cells.
 19. The apparatusof claim 16, wherein each of the plurality of memory cells on each ofthe first plurality of NAND strings corresponds to a weight of a neuralnetwork.
 20. The apparatus of claim 19, wherein the weights are binaryweights.